EnigmA Amiga Run 1998 July

home *** CD-ROM | disk | FTP | other *** search

/ EnigmA Amiga Run 1998 July / EnigmA AMIGA RUN 29 (1998)(G.R. Edizioni)(IT)[!][issue 1998-07 & 08].iso / recent / grabur.lha / graburl / GrabURL.txt < prev next >

Wrap

Text File | 1998-01-05 | 18KB | 649 lines

GrabURL 2.0 - Selective URL fetching utility Copyright (C) 1996-97 Serge Emond ---------------------------------------------------------------------------- This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. ---------------------------------------------------------------------------- Table of contents ~~~~~~~~~~~~~~~~~ 1. This document 2. Config file 3. Arguments 4. Configuration examples 5. Url Completion 6. Filenames 7. One or 2 examples 8. The Author 1. This document ~~~~~~~~~~~~~~~~ Many things may be incorrect or missing in this document. If you really want accurate informations, read the source code. 2. Config file ~~~~~~~~~~~~~~ Lines with '#' as the first non-whitespace character are comments. Arguments to commands can be enclosed by " or ' to contain spaces. 2.1. Global commands 2.1.1 Section Use: Section <SectionName> You need a matching END command for each section. A section is a part of the config file processed by another part of the program. Right now, there are 2 known sections: "http" and "scan". Each section has it's own set of commands and is processed independantly. There are also "global" options that are outside a Section. 2.1.2 End Use: End You need one to mark the end of a section and return to the "global" part. 2.1.3 Include Use: Include <filename> Includes a new config file that is processed exactly the same way the actual config is. Example: you have multiple config files that share a common part, include the common part at the beginning of the specialized config files. 2.1.4 Delay Use: Delay <time> This will make GrabURL pause between each file download to give a break to your link and to internet in general. The argument is in 1/10 of a second. It is highly recommended to use some delay (ie 10 or 20) for downloads that might get far in recursion, espetially if you have a fast internet connection. Harassing a host is not nice for them and can get you site-banned. Be nice and share the net! Default value: 0 2.1.5 EMail Use: EMail <your email> This is the email address sent to the remote server for each request. If you don't specify any EMail keyword, NO email information is sent. You should send this information so site operators might email you to be nicer instead of site ban you if you do something wrong. It could also prevent a GrabURL ban. (Refusing all connections made with graburl) 2.1.6 SaveRoot Use: SaveRoot <root directory> This tells GrabURL where to save the files. You can use "." (or . or '.') to save in currect directory. The default value is the current directory. 2.1.7 DirMode Use: Dirmode <mode> This is the mode (protection bits) for the directories created. See man page for mkdir(2) for more information about the modes. The mode is an octal number. The default value is 700 (octal), which means the protections rwx------. First position is user bits, second is group bits and third is others bits. The number is the sum of this: 1 for execute bit (x), 2 for write bit (w) and 4 for read bit (r). Mode of existing directories is NOT modified. 2.1.8 FileMode Use: Filemode <mode> This affects files the same way DirMode affects directories. The argument also have the same meaning. If a file gets overwriten, then mode WILL be modified. Default is 700 (octal). 2.1.9 Translate Use: Translate <from> <to> This modifies the filename and dir names. The first character of "from" is changed for the first character in "to". The 2nd char of "from" becomes the 2nd char of "to", etc... "from" and "to" MUST have the same length. Example: translate "~%()" "__++" If you download http://moo.com/~hey/Puppet(7).html, instead of saving it to "moo.com/~hey/Puppet(7).html" in the SaveRoot-directory, it will save it to "moo.com/_hey/Puppet+7+.html". 2.1.10 Retry Use: Retry When present, GrabURL will retry every URL that failed ONCE to be downloaded. I'm not even sure it is completly implemented. (!) 2.1.11 Scan Use: Scan When present, GrabURL will scan all HTML files and add the urls contained in them to the download-list. A file is considered as being HTML if: 1- The server says it is html 2- it contains ".htm" in the filename. (ie "bobo.htmt.jpg" will be scanned) 2.1.12 StayInHost Use: StayInHost When in config file, GrabURL will only add urls to the list (when a URL is flagged SCAN, ie with the SCAN config-command, SCAN command-line option or if specified in the workfile) only if the host name of the to-be-added urls is the SAME as the host name of the scanned file. Example: you scan the result of http://hop.com/. http://hop.com/2.gif WILL be added to the list, but http://www.yahoo.com/ won't. 2.1.13 NonExists Use: NonExists No files gets overwritten on the disk. If the file already exists, the url is skipped. 2.1.14 Depth Use: Depth <depth> Each file has a "depth", which is an integer. If the file is HTML and scanned, the depth of each files added to the list from this file will have it's depth minus one. (ie x.html has depth 4 and is scanned, all new files will have a depth of 3). A file with depth 0 will NOT be scanned. A file with depth -1 will give a depth of -1 to its child and WILL be scanned. The default is -1 (scan all internet) 2.1.15 AddLog Use: AddLog <filename> [<optional flags>] This adds a log to which text information will be appended while GrabURL runs. Filename is simply the filename to add information to. There are 2 special filenames: stdout and stderr, which will give output to the standard output and standard error streams, respectively. Options are: date add date stamp time add time stamp type add the type of output # all characters from # to EOL are ignored (ie a comment) For example, AddLog "stderr" type AddLog "/var/log/GrabURL.log" type date time will do 2 things: send output to /var/log/GrabURL.log like this: > 01/00/98 00:06:49 [1/1] http://localhost/ > 01/00/98 00:06:49 Received 292 bytes in 0 sec and print this on the screen, via the error stream: > [1/1] http://localhost/ > Received 292 bytes in 0 sec Types: ? informations * error + Scan stuff (additions to the list) > Action like "Receiving file" X Debug information (not much debug right now, and have to be compiled with GU_DEBUG) 2.2 Section HTTP The following options HAS to be enclosed by: Section "http" and end 2.2.1 Auth Use: Auth <realm> <userpw> This is for password-protected HTTP files. The only authentication scheme implemented is the only official one I know about: "basic" (see rfc1945). "realm" is a code associed to the protected page(s) to let clients (graburl, netscape, ...) know what password to use for that page. If you dont know the realm of a password protected page, simply try to get it, GrabURL will tell you what is the realm. "userpw" has 2 parts: you user name and your password, separated by a semi-colon. That part is "encrypted" using base64 and send directly to the server. (So anyone that intercepts your request can know your password ;) Example config-file: auth flyers 'joe:foo' Auth "MoNgOlfier SeX life" james:tkirk If GrabURL tries to get a protected page and the servers says the realm is "flyers", the user name "joe" with password "foo" is used. If the realm is "MoNgOlfier SeX life", user/password used are james/tkirk. Try to use the right case for the letters ;) Can be used more than once. 2.2.2 RequestLine Use: RequestLine <line> This adds the line <line> exactly as you typed it to ALL HTTP requests. CR/LF is appended. 2.3 Section SCAN The following commands has to be enclosed between Section "scan" and End 2.3.1 Key Use: Key <key> What I call a "key" (keyword) is a word in the HTML language that indicates the presence of an URL following. In an HTML file, it will look like this, for a key named "src": <key key key src ../hop.jpg key key> <key key src=../hop.jpg key> <key key src="../hop.jpg" key> etc... This will add the relative url ../hop.jpg to the list. I know 3 keys for now: SRC which is used for images and frames BACKGROUND which is used for... backgrounf image HREF which is used to link to other images or HTML pages or email or whatever. I strongly suggest using this in your config file: key SRC key BACKGROUND key HREF 2.3.2 Ignore Use: Ignore <begin> [<end>] Works a bit like "key". When the scanner encounters a tag containing "begin" in it, it will skip EVERYTHING up to the moment it encounters the matching "end" tag. If end is not specified, the skipping will stop when the "begin" tag is encountered again. 2.3.3 Diese Use: Diese <search|normal> A diese "#" (number sign) is means, in the "url language", to search for a label in an html file after displaying it. For example, http://hey.com/hop.html and http://hey.com/hop.html#contents These urls both points to the SAME file, "http://hey.com/hop.html". In the second one however, your browser (ie netscape) will search for a label named "contents" and position the page accordingly. When "diese search" is used, the part beginning with "#" is stripped from the name and it will act NORMALLY, ie the # means to search in the page. "diese normal" will make the # a normal character and the request for the web page will be sent WITH "#". You should always use "diese search", which is the default, unless you have really specific and private use. 2.3.4 Skip Use: Skip <regular expression> This is a regular expression that, if matched, will make the to-be-added URL rejected and NOT being added. You can use as many Skip Reg Expressions as you wish, and if a single one of them is matched, the url is rejected. See "Pattern" below for example. 2.3.5 Pattern Use: Pattern <regular expression> This is a regular expression that MUST be matched for GrabURL to add an URL to the list. You can use as many as you want and ALL of them will have to be matched for an url to be appended to the list. Example: Pattern .jpe*g$ Skip foo When an HTML file is scanned, the urls found in it will only be appended to the download-list if: 1) "foo" is NOT in the url name 2) the url must end with .jpg, .jpeg, .jpeeg, .jpeeeg, .... ie, http://www.pod.ca/andrew.html rejected http://www.foo.org/hey.jpeg rejected: contains "foo" http://www.bar.org/hey.jPeEg accepted 3. Arguments ~~~~~~~~~~~~ Arguments are the information you pass to graburl via the command line. ALL the arguments not matching one of the following is considered as an URL to be added to the download-list. 3.1 Help Arg: help Syn: ?, -h, --help Gives a summary of the commands supported by GrabURL. 3.2 License Arg: license Shows copyright notice and informations about warranty, etc.. (GPL) 3.3 Config Arg: Config <config_file> Syn: c <config_file> Use this config file. The search order, when that options is not specified, if ".graburlrc" in your home directory, then "/etc/graburlrc". 3.4 SaveRoot Arg: SaveRoot <root_dir> Syn: sr <root_dir> Specifies the root-directory. See SaveRoot in config file for more infos. 3.5 InFile Arg: Infile <file> Syn: if <file> Points to a file containing one URL per line. You can specify multiple 'infiles'. 3.6 WorkFile Arg: Worfile <file> Syn: wf <file> Points to a work file in which each url is "saved" with his options and status. When GrabURL READS the file, it can be simply one URL on each line, like for the InFile option. Basically, each line is like this: <type> <url> [<options>] Type can be: Q Queued - not processed/downloaded yet R Received - it's on your disk now :) F Failed - will be retried if you specified the RETRY option. X Fatal error - document don't exists, ... M Moved, the server generally gives a new URL and GrabURL will automatically add it to the list. (else) Skip that url, do nothing with it. An exemple of moved url is if you try to get http://hop.com/vip and vip is a directory, the server generally responds with a "moved" error and will give GrabURL the new location "http://hop.com/vip/". The URL is the url.. enclosed or not with " or '. Options are: TO=filename Filename to save the file to D=depth depth of this file if not -1. RETRY retry flag for this url NODIRS or ND Do not create sub-directories for this url NONEXISTS or NE Do not download if a file of that name is on disk SCAN Scan this file, if HTML, for recursion. STAYONHOST Same as StayInHost config option IFMODIFIED Send an IfModified request to server 3.7 SaveTo Arg: SaveTo <file> Syn: st <file> Saves the url to the given file. You may not specify a workfile or more than one url when you use this option on the command line. 3.8 Retry Arg: retry or noretry Sets/unsets the RETRY flag. 3.9 Ifmodified Arg: IfModified or NOIfModified Syn: im or noim Sets/unsets the IfModified flag. IfModified is a request sent to the server telling the date of the file on disk, if we already have it. If the server supports that option and our file is newer than theirs (ie the file haven't been modified), the server tells us it is not modified and dont send the file. 3.10 NonExists Arg: NonExists or NONonExists Syn: ne or none Sets/unsets the NO_EXISTS flag. If set, with "NonExists" on commant line or in config file, files dont get overwritten and url is skipped. 3.11 NoDirs Arg: NoDirs or NONoDirs Syn: nd or nond Sets/unsets the nodir flag. See lower about how files are saved to disk. 3.12 SaveHeader Arg: SaveHeader or NOSaveHeader Syn: sh or nosh See config file. 3.13 Scan Arg: scan or Recursive Syn: r Scan files for recursive downloads. Use NOSCAN to turn it off if you set it via the config file. 3.13.1 StayInHost Arg: StayInHost Syn: sih The scanner only adds an url if it has the same hostname then it's parent. 3.13.2 Pattern Arg: Pattern <regexp> Syn: p <regexp> See config file. You may have multiple patterns. 3.13.3 SkipPattern Arg: SkipPattern <regexp> Syn: sp <regexp> See config file. You may use multiple skip patterns. 3.13.4 Depth Arg: Depth <depth> Syn: d <depth> See config file. 4. Configuration example ~~~~~~~~~~~~~~~~~~~~~~~~ ---------- Begins here # This is a comment!! Translate "~%()" "____" SaveRoot . FileMode 0700 DirMode 0700 Delay 5 AddLog "stderr" type AddLog "GrabURL.log" type date time Section "http" # Realm yop, user grey, password moppe Auth yop grey:moppe RequestLine "Accept: */*" End Section "scan" key SRC key BACKGROUND key HREF ignore BLOCKQUOTE /BLOCKQUOTE diese search EndSection ----------- EOF 5. Url Completion ~~~~~~~~~~~~~~~~~ For now, GrabURL only understands HTTP. Wanted to add FTP but I dont have the time or the need for it. If you give GrabURL (on the command line, or in workfile or infile) a name, alone, like "www.yahoo.com", graburl will complete it to "http://www.yahoo.com/". The scanner will also encounter relative urls in HTML files. Those urls are not complete, the refer to a file relative to the actual document. For example an url "../imgs/banner.jpg" in the file comming from "http://hey.com/carlton/index.html" points in reality to the file "http://hey.com/imgs/banner.jpg". GrabURL understands names beginning with "/", containig "." and/or ".." and will modify the url accordingly. 6. Filenames ~~~~~~~~~~~~ Before creating a file, GrabURL will change it's working directory to the "saveroot" directory. Then it creates a directory with the name of the host and cd to it. Then creates all the sub-directory names in the url and put the file there. For example, if saveroot is "/tmp", the url http://buz.com/~hey/ho.html will be saved to /tmp/buz.com/~hey/ho.html. If you have a translation for "~" to "_", it will be saved as /tmpTbuz.com/_hey/ho.html. If the option "nodirs" is specified, the filename fill be /tmp/ho.html. If an url ends with "/", the filename will be "index.html". 7. One or 2 examples ~~~~~~~~~~~~~~~~~~~~ graburl www.umontreal.ca -> will download http://www.umontreal.ca/ graburl www.umontreal.ca r sih -> Downloads the complete site of Université de Montréal. graburl www.umontreal.ca r p www.umontreal.ca -> same as the previous one graburl www.umontreal.ca r p www.umontreal.ca sp .gif$ -> same as above but don't download files ending with ".gif". graburl www.umontreal.ca r sih wf hey.wf - stop the transfert after 2-3 files then type graburl wf hey.wf -> Begins to download recursively as in the 2nd example, user aborts the download, then you resume it. 8. How to contact the author ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ email: Serge Emond <greyl@videotron.ca> Serge Emond <ei971807@uqac.uquebec.ca> smail: Serge Emond 3392 des Anemones Jonquiere, Quebec Canada, G7S 5V4 You can also check the official GrabURL web page: http://pages.infinit.net/greyl/graburl/ It doesn't contains much but you can download graburl from there. Please dont write things like this: "It only downloads the first file and doesn't do recursion, why?" And please have a look at the README.txt file that came with this.